On the complexity of non-binary tree reconciliation with endosymbiotic gene transfer

Reconciling a non-binary gene tree with a binary species tree can be done efficiently in the absence of horizontal gene transfers, but becomes NP-hard in the presence of gene transfers. Here, we focus on the special case of endosymbiotic gene transfers (EGT), i.e. transfers between the mitochondrial and nuclear genome of the same species. More precisely, given a multifurcated (non-binary) gene tree with leaves labeled 0 or 1 depending on whether the corresponding genes belong to the mitochondrial or nuclear genome of the corresponding species, we investigate the problem of inferring a most parsimonious Duplication, Loss and EGT (DLE) Reconciliation of any binary refinement of the tree. We present a general two-steps method: ignoring the 0–1 labeling of leaves, output a binary resolution minimizing the Duplication and Loss (DL) Reconciliation and then, for such resolution, assign a known number of 0s and 1s to the leaves in a way minimizing EGT events. While the first step corresponds to the well studied non-binary DL-Reconciliation problem, the complexity of the label assignment problem corresponding to the second step is unknown. We show that this problem is NP-complete, even when the tree is restricted to a single polytomy, and even if transfers can occur in only one direction. We present a general algorithm solving each polytomy separately, which is shown optimal for a unitary cost of operation, and a polynomial-time algorithm for solving a polytomy in the special case where genes are specific to a single genome (mitochondrial or nuclear) in all but one species. This work represents the first algorithmic study for reconciliation with endosymbiotic gene transfers in the case of a multifurcated gene tree.


Introduction
Reconciliation is the process of embedding a gene family tree into a species tree (i.e. reconstructing a mapping between the gene tree and the species tree) to explain how the gene family evolved inside the species tree according to the gene tree model, through evolutionary events modifying gene contents in genomes, such as losses, duplications or horizontal gene transfers (HGTs). This allows deciphering the orthology (divergence through speciation), paralogy (divergence through duplication) or xenology (divergence through HGT) relation between genes, which has important implications on understanding functional specificity of gene copies. For this purpose, the most critical part is the construction of a "good" gene tree, i.e. a gene tree reflecting the true evolution of the nucleotide or amino acid sequences of genes. In fact, as shown in many studies [1], the result of a reconciliation model strongly depends on the considered trees. For example, due to potential errors in the trees, some of the plant datasets analysed in [2] produced unrealistic evolutionary histories with unexpected high number of gene duplications and losses.
Unfortunately, for many reasons related to sequence alignment, limitations of the considered phylogenetic method or issues with the sequence dataset (not enough mutations or too many, both cases leading to absence of signal), gene trees are almost never inferred with absolute certainty. As phylogenetic reconstruction methods are usually accompanied with statistical evaluations on branches, a solution for removing ambiguities in a tree is collapsing its weakly supported branches, leading to a non-binary tree (tree with multifurcated nodes, also called polytomies). The problem then becomes one of simultaneously finding a binary refinement and optimal reconciliation of the multifurcated tree, more precisely, inferring an optimal evolutionary scenario leading to a binary refinement of the tree. This strategy has been applied, for example, to infer the evolution of the gene families responsible for alkaloid accumulation in plants [3].
Reconciling a non-binary gene tree with a binary species tree can be done efficiently in the absence of HGTs (a review can be found in [4]). As far as we know, the most efficient algorithm for minimizing a Duplication/Losses (DL) distance is PolytomySolver [5], which handles unit costs in linear time, improves the best complexity of previous algorithms for the general DL cost model by a linear factor and enables to account for various evolutionary rates across the branches of a species tree. However, the problem becomes NP-hard in the presence of gene transfers [6]. Various heuristics have been developed for the DTL (Duplication, Transfer, Loss) reconciliation of a non-binary gene tree with a binary species tree [7][8][9].
In this paper, we focus on the particular case of DTL non-binary gene tree reconciliation, where transfers can only move genes between the mitochondrial and nuclear genome of the same species -called endosymbiotic gene transfers. In fact, it is well known that episodes of such gene transfers, mainly from the mitochondria to the nucleus, have marked the eukaryote evolution since an initial endosymbiotic event integrating an α−proteobacterial genome into an eukaryotic cell, which is known to be at the origin of all extent mitochondria. Such events resulted in a significant reduction of the mitochondrial genome. Understanding how both nuclear and mitochondrial genomes have been shaped by gene loss, duplication and transfer is important to shed light on a number of open questions regarding the origin, evolution, and characteristics of gene coding capacity of eukaryotes, but also on the rooting of the eukaryotic tree.
From a computational point of view, EndoRex [2] is the first algorithm developed for integrating such endosymbiotic events in a reconciliation model. Given a gene family with gene copies labeled by 0 or 1 depending on whether they are encoded in the mitochondrial or nuclear genome of a given species, a binary gene tree for the gene family and a binary species tree for the considered species, EndoRex infers a most parsimonious scenario of duplications, losses and endosymbiotic gene transfers (EGT) explaining the gene tree given the species tree. It is an exact polynomial-time algorithm, which can be used to output all minimum cost solutions, for arbitrary costs of operations.
Here, we explore the case of a non-binary gene tree. More precisely, given a multifurcated gene tree for a gene family with 0-1 labeled genes (leaflabels of the gene tree), the problem consists in inferring a most parsimonious duplication, loss and EGT scenario leading to a binary refinement of the tree. Our method is in two steps: ignoring the 0-1 labeling of the gene tree leaves, output all resolutions minimizing the DL-Reconciliation cost and then, for each resolution (i.e. binary tree), assign a known number of 0s and 1s to the leaves in a way minimizing EGT events.
Step one can be done efficiently as recalled above. Therefore, we focus on the second step which consists in assigning a 0-1 labeling to the nodes of a binary tree, in a way minimizing the considered evolutionary distance. We show in "Complexity of the dle-binl and dle-binl1 Problems" and "The one-direction DLE-reconciliation problem" sections that this problem is NP-complete, even when the tree is restricted to a single multifurcated node (also called polytomy) and, surprisingly, even if transfers can occur in a single direction (e.g. from the mitochondrial to the nuclear genome). It is polynomial in the very restricted case of a binary tree obtained as an optimal refinement (step 1) of a star-tree, and with each leaflabel present at most a fixed number of times. We then, in "A general algorithm for the dle-binl problem" section, present a general algorithm solving each polytomy separately, which is shown optimal for a unitary cost of operations.
Except for species conserving the traces of an ancestral eukaryotic origin, few genes are expected to reflect an intermediate endosymbiotic integration of the mitochondrial gene content to the nucleus, with gene copies in both the nuclear and mitochondrial genome. This is the case of the eukaryotes with complete mitochondrial genomes explored in [10] (statistics summarized in [2]): among the 2,486 species, only 52 species have mitochondrial-encoded genes also present in the nuclear genome. This motivates "An exact algorithm for the one-species version of the dle-binl1 problem" where we develop a polynomial-time algorithm for the b-labeling problem in the special case where, in each polytomy, genes are specific to a single genome (mitochondrial or nuclear) in all but one species. We first begin, in the next section, by formally defining our problems.

Preliminaries, evolutionary model and definitions
All trees are considered rooted. Given a tree T, we denote by r(T) its root, by V(T) its set of nodes and by L(T ) ⊆ V (T ) its leafset. We call n = |L(T )| the size of T.
A node x is a descendant of y if x is on the path from y to a leaf of T and an ancestor of y if x is on the path from r(T) to y; x is a strict descendant (respect. strict ancestor) of x ′ if it is a descendant (respec. ancestor) of x ′ different from x ′ . Moreover, x is the parent of y = r(T ) , denoted p(y), if it directly precedes y on this path. In this latter case, y is a child of x. We denote by E(T) the set of edges of T, where an edge is represented by its two terminal nodes (x, y), with x being the parent of y. More generally, if x is an ancestor of y, (x, y) denotes the path between x and y. The subtree of T rooted at x (i.e. containing all the nodes The lowest common ancestor (LCA) in T of a subset L ′ of L(T), denoted lca T (L ′ ) , is the ancestor common to all the nodes in L ′ which is the most distant from the root.
An internal node (a node which is not a leaf ) is said to be unary if it has a single child, binary if it has two children, and a polytomy if it has at least two children. Moreover, a star-tree is a tree with a single internal node. We will denote by x l and x r the two children of a binary node. The node x l (respec. x r ) is called the sibling of x r (respec. x l ).
A tree R is an extension of a tree T if it is obtained from T by grafting unary or binary nodes in T, where grafting a unary node x on an edge (u, v) consists in creating a new node x, removing the edge (u, v) and creating two edges (u, x) and (x, v), and in the case of grafting a binary node, also creating a new leaf y and an edge (x, y). In the latter case, we say that y is a grafted leaf. Moreover, given A species tree for a set of species is a tree S with a bijection between L(S) and . In this paper, we assume that the species tree S for a given set of species is known, rooted and binary. For example, the tree S in Fig. 1.(1) is a species tree for the set of species = {A, B, C} . A gene family is a set Ŵ of genes where each gene x ∈ Ŵ belongs to a given species s L (x) of . A tree G is a gene tree for a gene family Ŵ if its leafset is in bijection with Ŵ . We write G, s L when each leaf of G is meant to be fully identified by its species labeling, i.e. the species s L (x) it belongs to (e.g. gene tree in Fig. 1.(3); lowercase letters represent genes in the genome represented by the same letter in uppercase).
In this paper, we will consider an additional b-labeling for a gene x: b L (x) = 0 if x belongs to the mitochondrial genome of s L (x) , and b L (x) = 1 if x belongs to the nuclear genome of s L (x) . We write G, s L , b L when we want to specify that each leaf of G is fully identified by these two labels (e.g. trees (2) and (4) in Fig. 1). To summarize, G, G, s L and G, s L , b L are three notations for a gene tree, the two last specifying the way the leaves of G are identified. Later, we will need to define labeling for internal nodes of G.
A binary tree is a tree with all internal nodes being binary. If internal nodes have one or two children, then the tree is said partially binary. A multifurcated tree is a tree containing at least one polytomy. For example, in Fig. 1, the tree (2) is a multifurcated tree with two polytomies.
As for a multifurcated tree G M , s M L , a binary refinement G, s L and the set of binary refinements B( G M , s M L ) are defined in the same way, just ignoring the b-labeling.
In Fig. 1, the tree in (4) is a binary refinement of the tree in (2), and the tree in (3) is the same binary refinement, just ignoring the 0-1 labeling of leaves. We need a final notation. Let X ⊆ L(�G, s L , b L �) . The count matrix Count(X) for X is a | | × 2 matrix defined as follows:

DLE reconciliation
Inside the species' genomes, genes undergo Speciation (Spe) when the species to which they belong do, but also Duplication (Dup) i.e. the creation of a new gene copy, Loss of a gene copy, and transfer when a gene is transmitted from a source to a target genome. In this paper, we only consider endosymbiotic gene transfers, denoted EGT, i.e. the special case of transfers only allowing the transmission of genes from the mitochondrial genome to the nuclear genome of the same species, or vice-versa. If the transmission of a gene from a genome A to a genome B is accompanied by the loss of the gene in A, we refer to the event as an EGTL for ( EGT − Loss ) event.
We are now ready to recall the definition of a DLE-Reconciliation as introduced in [2].
Count(X)[σ , 0] = number of genes g ∈ X such that s L (g) = σ and b L (g) = 0 Count(X)[σ , 1] = number of genes g ∈ X such that s L (g) = σ and b L (g) = 1 Definition 2 (DLE-Reconciliation) Let G, s L , b L be a rooted binary gene tree for a gene family Ŵ and S be a rooted binary species tree for the species the genes belong to. A DLE-Reconciliation of G, s L , b L with S (or simply DLE-Reconciliation if no ambiguity) is a quadruplet R, s, b, e where R is a partially binary extension of G, s is an extension of s L from V(R) to V(S), b is an extension of b L from V(R) to {0, 1} , and e is an event labeling of the internal nodes of R, such that: 1 Each unary node x with a single child y is such that e(x) = EGTL , s(x) = s(y) and b(x) = b(y) ; x is an EGTL event with source genome σ b(x) and target genome σ b(y) , where σ = s(x) (or equivalently s(y)). 2 For each binary node x of R with two children x l and x r , one of the following cases holds: (a) s(x l ) and s(x r ) are the two children of which case e(x) = EGT ; let y be the element of {x l , x r } verifying b(x) = b(y) , then e(x) is an EGT with source genome σ b(x) and target genome σ b(y) .
Grafted leaves in the extension R correspond to gene losses. As R is as an extension of G, each node in G has a corresponding node in R. In particular, the s, b and e labeling on R induce an s, b and e labeling on the nodes of G. The difference between G and R are additional binary nodes with a child being a grafted leaf (a loss), and unary nodes corresponding to EGTL events.
A DL-reconciliation of G, s L is defined as in Definition 2, ignoring the b-labeling, i.e. it is a tuple R, s, e where R is an extension of G. For example, in Optimal reconciliation: Let c be a function attributing a cost to each event in DLE = {Spe, Dup, Loss, EGT , EGTL} . As it is usually the case, we will assume a 0 cost for speciations and positive costs for all the other events. Moreover, we assume that c(Dup) < c(EGT ) + c(EGTL) as otherwise duplications could be never inferred in a most parsimonious reconciliation. Similarly, we assume c(EGT ) < c(Dup) + c(EGTL) to allow for EGTs and c(EGTL) < c(EGT ) + c(Loss) to allow for EGTLs.
Given a DLE-Reconciliation R = �R, s, b, e� (respec. DL-Reconciliation R, s, e ), the cost C(R) of R is the sum of costs of the events labeling the internal nodes of R plus the sum of costs of the losses, i.e. C(R) = x∈V (R)\L(R) c(e(x)) + |L(R) Loss | * c(Loss) where |L(R) Loss | is the number of losses in R . In this paper, we seek for a most parsimonious reconciliation, i.e. a reconciliation of minimum cost, also called optimal reconciliation. We denote by DLE(G, S) (respec. DL(G, S)) the cost of an optimal DLE-Reconciliation (respec. DL-Reconciliation).
From now on, we denote by δ , , τ and ρ respectively, the cost of a duplication, a loss, an EGT and an EGTL event. The cost function is said to be unitary when δ = = τ = ρ.
The following lemma makes the link between an optimal DLE-Reconciliation and the optimal DL-Reconciliation.

Lemma
1 Any optimal DLE-Reconciliation Proof Let's consider, by contradiction, an optimal DLE-Reconciliation R DLE of G, s L , b L that cannot be obtained from the optimal DL-Reconciliation by possibly adding unary nodes and possibly converting duplications into EGTs. Let's now consider the DL-Reconciliation R DL obtained from R DLE by removing all unary nodes, converting all EGTs into duplications and ignoring the binary assignement of genes. Let x be a duplication of R DL with at least one loss as a child. By construction of R DL , x is either a duplication or an EGT node in R DLE .
1 If x is a duplication in R DLE , then removing this duplication and one of its loss child and connecting its other child to its parent (if the x is the root then its other child becomes the new root) would result in a DLE-Reconciliation R ′ DLE which cost is lower than C(R DLE ) . This contradicts the fact that R DLE is optimal.
2 If x is an EGT in R DLE , then replacing this EGT by an EGTL node and removing its loss child from R DLE would result in a DLE-Reconciliation R ′ DLE which cost is lower than C(R DLE ) (because we assume c(EGT ) + c(Loss) > c(EGTL)). This also contradicts the fact that R DLE is optimal.
Therefore, R DL has no duplication node with a loss as a child and thus all duplication nodes of R DL have a corresponding node in G. Let R * DL be the optimal DL-Reconciliation of G with S. Note that R DL cannot have less duplication nodes than R * DL as the optimal DL-Reconciliation has the minimum number of duplication nodes possible for a DL-Reconciliation [11]. As each duplication node in R DL has a corresponding node in G, it has also a corresponding node in R * DL . If each such duplication node in R DL is also a duplication node in R * DL , then R DL = R * DL , which is in contradiction with the hypothesis. Therefore, there is a least one duplication node x in R DL which corresponding node in R * DL is a speciation. Both the children of x in R DL must have a loss as a child as otherwise x would be a speciation. Similarly to the previous case, x is either a duplication or an EGT in R DLE and removing the loss children of its two children (and eventually adding an EGTL event if needed) results in a DLE-Reconciliation R ′ DLE with x transformed into a speciation, and thus C(R ′ DLE ) < C(R DLE ) . This is a contradiction as we supposed R DLE to be optimal.
Recall that the optimal DL-Reconciliation is unique and s DL is the LCA-mapping [4], i.e. for each node x of R DL corresponding to a node of G, s DL (x) = lca S ({s L (g) : g ∈ G[x]}) . Moreover, as s DLE is an extension of s DL and R DLE is an extension of R DL , for each node x of G, s DLE (x) = s DL (x) . See for an example the optimal DLE-Reconciliation in Fig. 1. (6), obtained from the optimal DL-Reconciliation (5) by converting two duplication nodes into EGT nodes and adding an EGTL unary node on the terminal edge leading to the gene in genome C.
Given a DLE-Reconciliation R DLE , removing an even number of consecutive EGTL nodes can only lead to a more parsimonious DLE-Reconciliation. Therefore, we assume that a reconciliation does not involve such nodes. This assumption is used in the following definition of a compressed reconciliation.

Definition 3 (Compressed reconciliation)
A compressed DLE-Reconciliation of G, s L , b L is a tuple G, s, b, e V , e E obtained from a DLE-Reconciliation R, s, b, e of G, s L , b L , where e V is simply e restricted to the nodes of G and e E is a P/A (Presence/Absence) labeling of the edges of G indicating the presence or absence of an EGTL node on that edge, i.e. obtained as follows: Let G ′ be the tree obtained from R by removing grafted leaves and their parental nodes (i.e. ignoring losses). For each edge (x, y) of G, let x ′ , y ′ be the corresponding nodes in G ′ ( G ′ differs from G only by unary nodes). Then: A compressed DL-Reconciliation of G, s L is defined similarly, ignoring b and the e E labeling. For example, in Fig. 1, the compressed DL-Reconciliation of (5) is simply that tree R( G, s L ) where we ignore losses, i.e. dotted lines. Moreover, the compressed DLE-Reconciliation of (6) is that tree R( G, s L , b L ) where we ignore losses and replace the unary node (EGTL) on the branch leading to c 1 by a label on that branch. For a compressed DLE-Reconciliation R c = �G, s, b, e V , e E � , denote by |e V EGT | the number of EGT nodes, by |e E | the number of edges labeled P, i.e. the number of EGTL events, and define the cost of R c as

Lemma 2 From a compressed DLE-Reconciliation
Let R DL = �R DL , s, e DL � be the optimal DL-Reconciliation of G with S. We construct a DLE-Reconciliation R = �R DLE , s DLE , b DLE , e DLE � from R DL and R c in linear time as follows: • R DLE is obtained from R DL by grafting a unary node (EGTL) on the edge (p(x), x) (in R DL ) for each node As R is constructed from R DL , it is easy to see that the species labeling of the nodes of R DLE is correct. By construction, the b-labeling of the nodes of R DLE is also correct, as the b-labeling b is assumed correct (thus the b-labeling of the nodes Notice that there are |e E | EGTL events and |e V EGT | EGT events in R . Also, the number of loss events in R is the same as the number of loss events in R DL . Let |e DL Dup | be the number of duplication nodes in the DL-Reconciliation. As an EGT event in R may only occur on a node that is a duplication in R DL , there are |e DL Dup | − |e V EGT | duplication events in R . Therefore, the cost of R is: Proof For a compressed DLE-Reconciliation R c = �G, s, b, e V , e E � , a DLE-Reconciliation leading to R c , of the same cost as R c , can be found in linear-time by the constructive proof of Lemma 2. In particular, a DLE-Reconciliation R can be obtained from an optimal compressed DLE-Reconciliation R c , and this DLE-Reconciliation R is necessarily optimal. In fact, from Lemma 1, any optimal DLE-Reconciliation R DLE can be obtained from the optimal DL-Reconciliation. Then, by construction of R DLE , , and thus C(R) ≤ C(R DLE ) , but as R DLE is by definition an optimal DLE-Reconciliation, we have C(R) = C(R DLE ) and thus R is also optimal.
The problem of finding an optimal DLE-Reconciliation is thus equivalent to that of finding an optimal compressed DLE-Reconciliation.
By default, we will consider compressed DLE-Reconciliations unless we explicitly state that the considered reconciliation is non-compressed.

Problem statements
The general problem of simultaneously refining and reconciling a multifurcated gene tree under the DLE evolutionary model is formulated as follows.
DLE Non-binary Reconciliation problem: Input: A binary species tree S, a multifurcated gene tree G M , s M L , b M L and a cost function c on DLE.
. The DL Non-binary Reconciliation problem is simply the restriction of the previous problem to DL-Reconciliation.
The complexity of the DLE Non-binary Reconciliation problem Problem is unknown. Our resolution method for this problem operates in two steps: Resolution method: Step 1: Find a binary refinement G, s L of G M , s M L leading to an optimal DL-Reconciliation.
Step 2: Given the binary tree G, s L obtained above, Although not guaranteed to be optimal, this method is a natural greedy heuristic for the DLE Non-binary Reconciliation problem. In fact, as stated in Lemma 1, an optimal DLE binary reconciliation (result of Step 2) is obtained from a DL binary reconciliation (result of Step 1) by simply converting some duplication nodes into EGT nodes and adding EGTL labels on branches. Moreover, Step 1 can be solved efficiently using existing algorithms such as PolytomySolver [5].
Having a binary refinement G, s L of G M , s M L , the problem then reduces (Step 2) to finding a b-labeling for G allowing for an optimal DLE-Reconciliation.
Notice that, in contrast to the species labeling s L , the b-labeling b L of the leaves of G is unknown after Step 1. The problem is therefore not reduced to extending a b L labeling to the internal nodes, but rather consists in finding an appropriate labeling b L of the leaves as well. This labeling is constrained by the b-labeling of G M , as formulated in the next lemma which is directly deduced from the definition of a binary refinement (Definition 1).
Therefore, in addition to G, s L corresponding to a binary refinement of G M , s M L , the input of Step 2 also includes a set of constraints induced by the b-labeling of V (G M ) . These constraints can be represented as a set of Fig. 2. (1)).

Definition 4
Given a binary tree G, s L and a b-constraint labeling (M, I) for G, a labeling b L is said to be consistent with (M, I) if, for any x ∈ I , Moreover, recall from Lemma 1 and Definition 3 that an optimal DLE-Reconciliation of a tree G, s L , b L is obtained from an optimal DL-Reconciliation of G, s L by possibly converting duplication nodes to EGTs and adding a P/A labeling on edges. Moreover, as noted before, the s labeling of an optimal DLE-Reconciliation should be the LCA-Mapping. We denote it s lca .
The main problem (Step 2) can thus be defined as follows. See an example in Fig. 2 where (1) is the input of the DLE-BinL problem and (2) is its output.
DLE-BinL Problem: Input: A binary tree G, s L , a b-constraint (M, I) and a species tree S; Output: Notice that, from Lemma 1, in the case of a unitary cost, the problem is equivalent to finding a minimum number of added EGTL events.
We call DLE-BinL1 the DLE-BinL problem where I is restricted to the root of G (which corresponds to considering a star-tree as the initial multifurcated tree).

Complexity of the DLE-BinL and DLE-BinL1 problems
In this section, the considered cost is unitary; the complexity results are then naturally extendable to a general cost. The DLE-BinL problem in its decision version is defined bellow; the decision version of DLE-BinL1 is defined similarly.
DLE-BinL decision version: Input: A binary tree G, s L , a b-Constraint (M, I), a species tree S and an integer Cost; First observe that the DLE-BinL decision problem is in NP. In fact, given a DLE-Reconciliation G, s lca , b, e V , e E of G, s L , b L , we can compute the cost of the DLE-Reconciliation (to verify if it is less than or equal to Cost) and verify if the b-labeling b L is consistent with (M, I) in polynomial time by traversing the tree G.
According to the considered Resolution method presented in "Problem statements" section, the input of Step 2 (finding an optimal DLE-Reconciliation of a binary gene tree) is not an arbitrary binary tree, but rather a binary refinement of an initial multifurcated tree G M , s M L , leading to an optimal DL-Reconciliation. In this section, we show that the DLE-BinL problem is NPcomplete event with this requirement, in all but one very constrained version of the problem.

Complexity of the DL-DLE-BinL1 problem
We first show, by reduction from Weighted Monotone one-in-three-satisfiability problem (Weighted Monotone 1-in-3-SAT Problem), that the DL-DLE-BinL1 decision problem is NP-complete. We can then deduce that DL-DLE-BinL is also NP-complete, as well as the more general DLE-BinL problem.
As the DLE-BinL decision problem is in NP, the DL-DLE-BinL1 decision problem is also in NP. The Weighted Monotone 1-in-3-SAT Problem is defined as follows (monotone meaning that there are no negation of variables in the clauses).
Weighted Monotone 1-in-3-SAT: with {x, y, z} ⊆ Ł and a positive integer n ( n ≤ m); Question: Is there a truth assignment with exactly n variables set to True satisfying C such that exactly one literal in each clause is set to True?
As the Monotone 1-in-3-SAT problem is NP-complete, the Weighted Monotone 1-in-3-SAT problem is also NP-complete.
Given an instance I = (C, Ł, n) of the Weighted Monotone 1-in-3-SAT problem, we compute, in polynomial time, a corresponding instance First, the set of species is computed as follows: The species tree S is: The gene tree G is then: Notice that for each species C i , 1 ≤ i ≤ k , G contains exactly 3 leaves mapped to C i and that for each species The b-constraint (M, I) is defined as follows: we require that one of the three leaves mapped to C i be labeled by 1 and that the remaining two leaves mapped to C i be labeled by 0.
that n of the m leaves mapped to T i s be labeled by 1 and that the remaining m − n leaves mapped to T i s be labeled by 0.
Finally, Cost is set to DL(G, S).

Lemma 4
The gene tree G, s L computed in the reduction is in B DL ( G, s L , S).
Proof Let G M be a star tree on the leaves of G and let R * DL be the optimal DL-Reconciliation of G with S. Notice that R * DL contains m − 1 duplication nodes and (m − 3) * k losses and thus C( We will now show that for any binary refinement G ′ of the star tree G M , if the optimal reconciliation of G ′ with S contains less than (m − 3) * k losses, then it contains at least m − 1 + (m − 3) * k duplication nodes. Let R DL = �R, s lca , e� be the optimal DL-Reconciliation of G ′ with S. Note that we consider here a non-compressed DL-Reconciliation. If the number of losses in R DL is less than (m − 3) * k , then there must exist i ( 1 ≤ i ≤ k ) such that there are less than m − 3 losses in the species Let ℓ 0 be the number of losses in C i in R DL and let ℓ s ( 1 ≤ s ≤ d ) be the number of losses in T i s in R DL . As exactly 3 leaves of G ′ are mapped to C i , there are 3 + ℓ 0 non-duplication nodes of R DL mapped to C i . There is thus at most 3 + ℓ 0 speciation nodes mapped to p(C i ) in R DL because a speciation node mapped to p(C i ) must have one child mapped to C i (that child may be a duplication node mapped to C i , but then this duplication node has at least two non-duplication nodes descendant mapped to C i that are not children of a speciation node mapped to p(C i ) ). Using the same reasoning, there are at most 3 + ℓ 0 + ℓ 1 speciation nodes mapped to p(T i 1 ) in R DL . The same reasoning can be applied to show that for each node x in {p(T i 1 ), p(T i 2 ), . . . , p(T i d )} , there are less than m speciation nodes of R DL mapped to x because 3 + d s=0 ℓ s < m . For 1 ≤ s ≤ d , as the m leaves of R mapped to T i s cannot all have a speciation node as a parent, there is at least one duplication node mapped to T i s in R DL . Therefore, there is at least d = m − 1 + (m − 3) * k duplication nodes in R DL and the cost of R DL cannot be lower than the cost of R * DL . Fig. 2 (1) A binary refinement G, s L of the multifurcated tree of Fig. 1.(2) and the corresponding b-constraint labeling (M, I): I is the set of nodes indicated by crosses, and for each such node x, M(x) is the table represented at that node; (2) The b L assignment leading to the optimal DLE-Reconciliation, also represented in Fig. 1.(6). Here, the compressed DLE-Reconciliation is illustrated, where the edge labeled P is the only one where an EGTL event is present If otherwise, for a binary refinement G ′ of the star tree G M , the optimal reconciliation of G ′ with S contains at least (m − 3) * k losses, then its cost is at least m − 1 + (m − 3) * k because it contains at least m − 1 duplication nodes as there are m leaves of G ′ mapped to T 1 1 . It thus cannot have a cost lower than C(R * DL ).
We conclude that the gene tree G, s L computed in the reduction is in B DL (G, S) .
We next show that I is a satisfiable instance of the Weighted Monotone 1-in-3-SAT problem if (Lemma 5) and only if (Lemma 6) its corresponding instance I ′ of the DL-DLE-BinL1 decision problem admits a DLE-Reconciliation of cost lower than or equal to Cost.

Lemma 5 Let I be a satisfiable instance of the Weighted Monotone 1-in-3-SAT problem. Then its corresponding instance I ′ of the DL-DLE-BinL1 decision problem admits a DLE-Reconciliation of cost lower than or equal to Cost.
Proof Let R DL = �G, s lca , e� be the optimal DL-Reconciliation of G with S. We will show that we can obtain a DLE-Reconciliation R DLE of cost lower than or equal to Cost from R DL by converting some duplication events into EGT events. Recall that because the costs are unitary, converting a duplication event into an EGT event does not change the cost of the reconciliation.
Let TA be a truth assignment with exactly n variables set to True satisfying C such that exactly one literal in each clause is set to True (we know that such truth assignment exists because I is a satisfiable instance).
We now construct the b-labeling b (and b L ) and the mappings e V and e E as follows: For all j, 1 ≤ j ≤ m , such that ℓ j is True (resp. False) in TA, we set b(x) = 1 (resp. b(x) = 0 ) for each node x of the subtree S j . Let j * be the smallest index such that ℓ j * is set to False in TA (this index exists, as a truth assignment setting all variables to True cannot be a solution to the Weighted Monotone 1-in-3-SAT problem). If j * > 2 we set b(x) = 1 for each node x on the path from the parent of r(S 1 ) to the parent of r(S j * −1 ) and we set b(y) = 0 for each node y on the path from the parent of r(S j * ) to r(G). Else (when j * ∈ {1, 2} ), we set b(x) = 0 for each node x on the path from the parent of r(S 1 ) to r(G).
There are no EGTL events in the subtrees S j ( 1 ≤ j ≤ m ) because all nodes in a given subtree S j have the same b-label. Notice that all nodes on the the path from the parent of r(S 1 ) to r(G) are duplication nodes in R DL and we can convert them to EGT events in R DLE . If j * ∈ {1, 2} , then, for 1 ≤ j ≤ m , if ℓ j is set to True in TA, we set e V (parent of r(S j )) = EGT (which is a transfer from 0 to 1). Else (when j * > 2 ), then we set e V (parent of r(S j * )) = EGT (which is a transfer from 0 to 1) and for j * + 1 ≤ j ≤ m , if ℓ j is set to True in TA, we set e V (parent of r(S j )) = EGT (which is a transfer from 0 to 1).
In both case, it is easy to see that this mapping is valid and that no EGTL events are required in R DLE .
As there are no EGTL events in R DLE , the cost of R DLE is DL(G, S) and thus C(R DLE ) ≤ Cost.
For each leaf x of G, we set b L (x) = b(x) . As exactly n variables are set to true in TA and as one variable per clause is set to True in TA, we know, by construction, that for each species C i , 1 ≤ i ≤ k , one of the three leaves mapped to C i is labeled by 1 and the remaining two leaves mapped to C i are labeled by 0 and that for each species We then obtain a DLE-Reconciliation R DLE = �G, s lca , b, e V , e E of G, s L , b L where b L is a b-labeling consistent with (M, I) for which C(R DLE ) ≤ Cost and we conclude that the instance I ′ of the DL-DLE-BinL1 decision problem admits a DLE-Reconciliation of cost lower than or equal to Cost.

Lemma 6 Let I be an unsatisfiable instance of the Weighted Monotone 1-in-3-SAT problem. Then its corresponding instance I ′ of the DL-DLE-BinL1 decision problem does not admit a DLE-Reconciliation of cost equal or lower than Cost.
Proof By contradiction, let us suppose that for an unsatisfiable instance I of the Weighted Monotone 1-in-3-SAT problem, its corresponding instance I ′ of the DL-DLE-BinL1 decision problem does admit an optimal DLE-Reconciliation R DLE of cost equal or lower than Cost. In that case, R DLE does not contain EGTL events as otherwise its cost would be greater than DL(G, S) = Cost by Lemma 1. As there are no duplication nodes in the DL-Reconciliation of the subtrees S j ( 1 ≤ j ≤ m ) with S, we know from Lemma 1 that no EGT events occur in those subtrees in R DLE . Therefore, by definition of a DLE-Reconciliation, for 1 ≤ j ≤ m , the nodes in S j have the same b-label.
We now define a truth assignment TA as follows: for all 1 ≤ j ≤ m , set the variable ℓ j to True if the b-label of the nodes in S j is 1, and set the variable ℓ j to False otherwise.
For each species C i (corresponding to the clause C i ), 1 ≤ i ≤ k , we know by construction that one of the three leaves mapped to C i is labeled by 1 and the remaining two leaves mapped to C i are labeled by 0 in G. Therefore the truth assignment TA satisfies C and for each clause C i , one literal is set to True and two literals are set to False in TA. We know that exactly n variables are set to True in TA, as exactly n subtrees S i have their nodes labeled by 1 because of the b-constraint (M, I) requiring exactly n of the m leaves mapped to T 1 1 to be labeled by 1.
I is then a satisfiable instance which is a contradiction. We thus conclude that if I is an unsatisfiable instance of the Weighted Monotone 1-in-3-SAT problem, then its corresponding instance I ′ of the DL-DLE-BinL1 decision problem does not admit a DLE-Reconciliation of cost equal or lower than Cost.
Since Weighted Monotone 1-in-3-SAT is NP-complete, Lemmas 5 and 6 lead to the following results.

A tractable version of the DL-DLE-BinL1 problem
Given σ ∈ � , the multiplicity M G,s L (σ ) of σ in G, s L is the cardinality of the set {x ∈ L(G) : s L (x) = σ } . The multiplicity factor M G,s L is the constant defined as max σ ∈� M �G,s L � (σ ).
The two following lemmas make the link between the maximum number of non-loss nodes in an optimal DL-Reconciliation R DL of �G, s L � ∈ B DL (�G M , s M L �, S) mapped to a given node in S, and the multiplicity factor M G,s L . We will then show that the DL-DLE-BinL1 Problem is fixed parameter tractable with respect to the multiplicity factor M G,s L . Consider Algorithm 1 above. We show that it transforms R DL into another DL-Reconciliation R ′ DL of another binary refinement of G M with one less speciation node mapped to σ than R DL and such that R ′ DL has a lower cost than R DL . This contradicts the fact that R DL is a reconciliation of a tree G, s L belonging to B DL ( G M , s M L , S).
It is straighfoward to see that this procedure leads to a valid DL-Reconciliation of a binary refinement of G M as all it does is replace the subtree R[x 1 ] by a loss in σ and place all the leaves belonging to R[x 1 ] elsewhere in R in a position respecting definition 2 (because the procedure only replaces losses in R by subtrees of R[x 1 ] which roots are mapped to the same species as the loss it replaces). In fact, every non-loss leaf of R[x 1 ] belongs to a species which, by the hypothesis, cannot be the species label of more than k − 1 other non-loss leaves of R DL , i.e. should be missing in at least one of the all separated subtrees This procedure never increases the number of duplication nodes in the reconciliation as it only replaces losses in R by subtrees of R[x 1 ] whose root is mapped to the same species as the loss it replaces. It adds one new loss to the DL-Reconciliation as the subtree R[ Proof As noted in the proof of Lemma 1, in an optimal DL-Reconciliation R , a duplication node cannot have a loss as a child. It follows from that fact and from the definition of a DL-Reconciliation that for a given species σ in V (S) \ L(S) (respectively σ ∈ L(S) ), the number of speciation nodes (respectively non-loss leaves) in R mapped to σ is at least one more than the number of duplication nodes mapped to σ and the number of non-loss leaves (respectively speciation nodes) mapped to σ is 0. By Lemma 7, we know that for any optimal DL-Reconciliation R DL of a tree �G, s L � ∈ B DL (�G M , s M L �, S) with S, the number of speciation nodes mapped to a given species is at most M G,s L (and, by definition, the number of non-loss leaves mapped to a given species is at most M G,s L ). Therefore the number of duplication nodes mapped to a given species is at most M �G,s L � − 1 . Thus, there are at most 2M �G,s L � − 1 non-loss nodes of R DL that are mapped to any given node in S.

Lemma 9
Let G M be a star-tree. For any optimal DL-Reconciliation R DL of a tree �G, s L � ∈ B DL (�G M , s M L �, S) with S, there are at most 3M �G,s L � − 1 nodes of R DL that are mapped to any given node in S.
Proof From Lemma 8, for any optimal DL-Reconciliation R DL of a tree �G, s L � ∈ B DL (�G M , s M L �, S) with S, the number of non-loss nodes mapped to a given species x is at most 2M �G,s L � − 1 . Moreover, in R DL , the parent of a loss node mapped to x is a speciation node mapped to p(x). By Lemma 7, we know that the number of speciation nodes mapped to p(x) is at most M G,s L . Therefore, the number of nodes in R DL mapped to x is at most 3M �G,s L � − 1 . Proof We can do so by using Algorithm 1 in [2]. Note that in that paper, EGTcopy holds for an EGT event and EGTcut holds for an EGTL event.
Let R DL = �R, s lca , e� be a non-compressed DL-Reconciliation of a tree G, s L with S. For the proof of the next Theorem, given a node σ of S, we denote by b[σ ] a given b-labeling for all non-loss nodes of R mapped to σ . Note that if there are k such nodes, then the number of possible b[σ ] labelings is 2 k . For a node σ of S, we define MaxTrees(σ ) to be the set of "maximum" subtrees of R which roots are mapped to σ , i.e. such that the parent of these roots are not mapped to σ . For a node σ ∈ V (S) \ L(S) , we define CutMaxTrees(σ ) as the set of subtrees obtained from MaxTrees(σ ) by removing from the subtrees all strict descendants of the roots of the trees in MaxTrees(σ l ) and MaxTrees(σ r ) . We also define, for any labeling b

Theorem 2 The DL-DLE-BinL1 decision problem is fixed-parameter tractable with respect to the multiplicity factor M G,s L .
Proof Here, we consider non-compressed reconciliations.
We can solve the DL-DLE-BinL1 decision problem using Algorithm 3.
We show by induction that Algorithm 3 computes the correct cost CostMaxTrees(σ , b[σ ]) for a given node σ in S and all possible b-labelings b[σ ].
If the node σ is a leaf of S, then Algorithm 3 computes the correct CostMaxTrees(σ , b[σ ]) by definition.
We may suppose now by the induction hypothesis that Algorithm 3 computes the correct cost for all possible b-labelings for the two children σ l and σ r of a given internal node σ of S. Let show that Algorithm 3 is correct for σ . By the hypothesis, the algorithm correctly computes Finally, the next theorem states that, in contrast to DL-DLE-BinL1 and DL-DLE-BinL, the general problems DLE-BinL1 and DLE-BinL remain NP-complete even if the multiplicity factor of G, s L is restricted to two.

Theorem 3 The DLE-BinL1 decision problem is NPcomplete, even for
The proof, given in Appendix, uses a reduction to the Monotone not-all-equal 3-satisfiability problem. The next corollary follows.

The one-direction DLE-reconciliation problem
As endosymbiotic transfer events often move genes from the mitochondrial to the nuclear genome, and rarely in the opposite direction, we address the specific case where transfers are only allowed in one direction, i.e. when b-labels can only switch from 0 to 1, or only from 1 to 0. In the following definition, with no loss of generality, we assume transitions from 0 to 1.

Definition 5 (One-direction DLE-Reconciliation) Let
G, s L , b L be a rooted binary gene tree. A One-direction DLE-Reconciliation for G, s L , b L is a DLE-Reconciliation G, s lca , b, e V , e E verifying: for each edge (x, y) One-DLE-BinL Problem: Input: A binary tree G L , s L , a b-Constraint (M, I) and a species tree S; Output: An optimal One-direction DLE-Reconciliation G, s lca , b, e V , e E of G, s L , b L with S where b L is a b-labeling consistent with (M, I).
We also define, in a similar way as before, the One-DLE-BinL1 problem where I is restricted to the root of G, and the corresponding decision problems. We next show that even this very restricted version of our initial problem is intractable. Moreover, the One-DL-DLE-BinL (respec.One-DL-DLE-BinL1) problem is defined as the One-DLE-BinL (respec. One-DLE-BinL1) problem with the additional restriction that the binary tree given as input is in B DL ( G M , s M L , S). We show that One-DL-DLE-BinL1 and One-DL-DLE-BinL are NP-hard but fixed parameter tractable with the multiplicity factor, while One-DLE-BinL1 and One-DLE-BinL are NP-hard even with a multiplicity factor of two.

Theorem 4 The One-DL-DLE-BinL decision problem is NP-complete.
Proof The proof for NP-completeness of One-DL-DLE-BinL1 is the same as that of Theorem 1, as the DLE-Reconciliation in the proof verifies the One-direction condition. The NP-completeness of One-DL-DLE-BinL follows.

Theorem 5
The One-DL-DLE-BinL1 is fixed parameter tractable with respect to the multiplicity factor M G,s L .
Proof Note that the proof of Lemma 1 holds for a One-direction DLE-Reconciliation, i.e. an optimal Onedirection DLE-Reconciliation can be obtained from the optimal DL-Reconciliation. Therefore, we can solve the One-DL-DLE-BinL1 Problem using the algorithm in the proof of Theorem 2, just giving an infinite cost for a transition from 1 to 0.
It follows from Theorem 4 that One-DLE-BinL is NPcomplete. However, as for DLE-BinL1 and DLE-BinL, One-DLE-BinL1 and One-DLE-BinL remain NP-complete even if the multiplicity factor of G, s L is restricted to two. The proof is given in Appendix.

Theorem 6
The One-DLE-BinL1 and One-DLE-BinL decision problems are NP-complete, even for M �G,s L � = 2.

A general algorithm for the DLE-BinL problem
A natural heuristic for the DLE-BinL problem for G, s L , where G is a binary resolution of an initial multifurcated tree with initial polytomies reflected by a b-Constraint (M, I), would be to solve each polytomy, i.e. each subtree rooted at a node x of I, individually, in a post-order traversal of the tree. In fact, this strategy leads to an exact algorithm for the DL Non-binary Reconciliation Problem [5]. However, in the case of DLE-Reconciliation, the b-labeling of internal nodes introduces a dependency between polytomies, avoiding the heuristic to be exact in general, i.e. for an arbitrary cost of operations. In this section, we present the general heuristic (Algorithm 4) and show that it is exact in the case of a unitary cost of operations.
Algorithm 4 traverses the tree G in post-order and each time it encounters a node x ∈ I , it "solves" the corresponding subtree G[x] and replaces it by a single leaf, with an appropriate b-label.
Once the tree G has been completely traversed, the subtrees are put back in the tree. Notice that on line 13, the algorithm adds a new intermediate species to Proof The proof is by induction on the number of node x ∈ V (G) such that x ∈ I.
Notice that the DLE-Reconciliation G, s lca , b, e V , e E returned by Algorithm 4 is such that b is a b-labeling consistent with (M, I) by construction.
If there is only one node x ∈ V (G) such that x ∈ I , then this node x is the root of G by definition. The algorithm then returns an optimal solution, as we assume that we can solve DLEBinLR(�G, s L �, M ′ (r(G)), S, i) (where M ′ (r(G)) = M(r(G)) ) for i ∈ {0, 1}.
If there is more than one node x ∈ V (G) such that x ∈ I , then the root of G is in I by definition. By induction, we may assume that for each node x ∈ V (G) \ r(G) such that x ∈ I , the reconciliation of G[x] computed by the algorithm is exact. For each of those subtrees G[x], we then know the possible b-label(s) at the root leading to an optimal reconciliation of G[x] and the corresponding optimal reconciliation of G [x]. We now give the index 1 to |I| − 1 to the elements of I\r(G) . For all 1 ≤ j ≤ |I| − 1 , there is then two cases for x j ∈ I \ r(G) : 1 G[x j ] is such that both b(x j ) = 0 and b(x j ) = 1 can lead to an optimal reconciliation of G[x j ] . In that case, Algorithm 4 will remove G[x j ] from G and replace it by a new leaf without a b-label. It solves G(x j ) separately and then replace the new leaf in G by the solved G[x j ] (after the rest of G is solved). G[x j ] can be solved separately in that case, because regardless of the b-label of the parent of G[x j ] in an optimal reconciliation of (the rest of ) G we can obtain an optimal reconciliation of G[x j ] with r(G[x j ]) having the same b-label as its parent (and thus we can obtain an optimal solution to the problem by putting the solved G[x j ] with r(G[x j ]) having the same b-label as its parent back in G).
can lead to an optimal reconciliation of G[x j ] . In that case, Algorithm 4 will remove G[x j ] from G and replace it by a new leaf with b-label by i j .
Then, Algorithm 4 solves DLEBinLR(�G ′ , s�, M ′ (r(G)), S, k) ( k ∈ {0, 1} ) where G ′ is the tree obtained after all the x j are visited by the algorithm. By construction, it will return the solution of lowest cost such that b(x j ) = i j , for all x j belonging to Case 2. Let's show that this solution is optimal. By contradiction, suppose that there is x j ∈ I\r(G) ( x j belonging to Case 2) such that there is no optimal solution of the problem for which b(x j ) = i j . Then, the optimal solution R * of the problem is such that b(x j ) = i j . In R * , if we set b(x j ) = i j and replace the reconciliation of the subtree G[x j ] by the optimal reconciliation of G[x j ] (that we can obtain because b(x j ) = i j ), we obtain a new solution R ′ of the problem with at most one more EGTL event (on the edge (p(x j ), x j ) ) and such that the reconciliation of G[x j ] in R ′ has a strictly lower cost than the reconciliation of G[x j ] in R * . There is then at least one less event in the reconciliation of G[x j ] in R ′ and as the cost are unitary, the solution R ′ is such that C(R ′ ) ≤ C(R * ) and thus R ′ is optimal. Contradiction. We then conclude that there is an optimal solution of the problem for which b(x j ) = i j .
Thus, Algorithm 4 returns an optimal solution for the input ( G, s L , s lca , (M, I), S).
We conclude, by induction, that the solution returned by Algorithm 4 is optimal.

An exact algorithm for the one-species version of the DLE-BinL1 problem
We consider a restriction of the DLE-BinL1 Problem where genes are specific to a single genome (the mitochondrial or nuclear genome) in all but one species. We call it the DLE-BinL1-OneSpecies problem. In its simplest version where a single species is present, the problem reduces to assigning a multiset of two labels (a given number of 0 s and a given number of 1 s) to the leaves of a tree-shape (i.e. a tree with no leaf labels), in a way minimizing 0-1 transitions in the tree. Similar problems on assigning leaves to tree-shapes or to multilabeled trees (MUL-trees) have been considered in the context of other tree distances (Robinson Foulds distance, path distance, maximum agreement subtree), most of them being NP-complete [12,13]. Here, we present an exact polynomial-time algorithm for the DLE-BinL1-OneSpecies Problem.
Let σ ∈ � be the only species for which the genes belonging to it are not specific to a single genome. We will call the leaves ℓ ∈ L(G) for which s(ℓ) = σ free leaves and the leaves ℓ ∈ L(G) for which s(ℓ) = σ fixed leaves. For a fixed leaf ℓ , b(ℓ) is fixed and known in advance, as all leaves whose species label is s(ℓ) have the same b-label which is known from the matrix M. The DLE-BinL1-OneSpecies problem is then reduced to finding an optimal DLE-Reconciliation for which exactly k free leaves are labeled by 0, where k = M(r(G))[σ , 0] (the (σ , 0) entry of M(r(G))).
Let R DL = �G, s lca , e� be the optimal DL-Reconciliation for G, s L . From Lemma 1, any optimal DLE-Reconciliation R DLE = �G, s lca , b, e V , e E � with exactly k free leaves labeled by 0 can be obtained from R DL by converting some duplications into EGTs and adding EGTL events, i.e. a P/A labeling on edges. We define minCostTransfer(�G, s lca , b, e V , e E �) = |e V EGT | * (τ − δ) +|e E | * ρ .
Then recall from "Preliminaries, evolutionary model and definitions" section that, by construction of R DLE , we have: The problem thus reduces to minimizing minCostTransfer( G, s lca , b, e V , e E ).
We will need to consider the two possible b-labelings i ∈ {0, 1} for the root of G. We therefore denote by minCostTransfer( G, s lca , e , i, k) the minCostTransfer function for an optimal DLE-Reconciliation R DLE with exactly k free leaves labeled by 0 and with the additional constraint that b(r(G)) = i.
We are now ready to present Algorithm 5. It proceeds in two steps: (1) a bottom-up step (Algorithm 6) in which we assign an array of size 2 × (k + 1) to each node x of G where the (i, j)th entry equals minCostTransfer(�G[x], s lca , e�, i, j) ; (2) a top-down step (not given in pseudo-code) in which the algorithm assigns the b-labeling of nodes and locates the EGT and EGTL events in the optimal solution. See Fig. 3 for an execution of Algorithm 5. Proof Assume that, for each entry of x.array of each internal node x, Algorithm 6 keeps in memory pointers to the entries of the arrays of the children of x from which the value of the entry was obtained.
Once the optimal arrays are computed for all nodes, the optimal solution is easily reconstructed from the entry min(r(G).array(0, k), r(G).array (1, k)) by following the pointers from the root to the leaves.
The key point is therefore showing that the arrays computed by Algorithm  If x is a leaf (either free or fixed), it is easy to see that x.array is correct. Now, if x is an internal node, we may assume that x l .array and x r .array are correct by the induction hypothesis. By contradiction, let's assume that there is (i, j) such that x.array(i, j) = minCostTransfer(�G[x], s lca , e�, i, j) . Let R be the optimal DLE-Reconciliation leading to minCostTransfer(�G[x], s lca , e�, i, j) . Then, in R , b(x) = i , b(x l ) = ℓ 1 where ℓ 1 ∈ {0, 1} and b(x r ) = ℓ 2 where ℓ 2 ∈ {0, 1} . Also, as there are j free leaves labeled by 0 under x, the sum of the numbers of free leaves labeled by 0 under x l and x r must be equal to j. If the genome labels of the children of x are not the same as i, x is converted as an EGT event if x is a duplication node in the DL-Reconciliation (and possibly an EGTL event is added) and if x is not a duplication node then some EGTL events may be added on the edges between x and its children. As the algorithm considers all possibilities of genome labels for x l and x r and all possibilities of number of free leaves labeled by 0 under x r and x l leading to j free leaves under x labeled to 0 (and considers the optimal assignation of EGT and EGTL events for the transfer(s) needed from x to its children), the particular possibility leading to R will be considered and then x.array(i, j) = minCostTransfer(�G[x], s lca , e�, i, j) . This is a Contradiction. Thus, there is no such (i, j) and x.array is exact.
We conclude, by induction, that the arrays computed by Algorithm 6 are exact. Once all the arrays are computed, the algorithm finds the optimal assignation of the internal nodes with a preorder traversal of G in time O(n) We conclude that the time complexity of Algorithm 5 is O(nk 2 ) .

Conclusion
In this paper, we present the first method for DLE-Reconciliation, that is a reconciliation accounting for duplications, losses, but also EGTs, for a multifurcated gene tree. It is a natural extension of the DL-Reconciliation of a multifurcated tree, where we first consider a solution for this problem, i.e. an optimal DL-Reconciliation, and then appropriately assign the binary b-labeling (0/1 for mitochondrial/nuclear) to the nodes of the tree in a way minimizing a total DLE (Duplications, Losses and EGTs) cost.
We show that the optimal b-labeling assignment step is NP-complete even if the gene tree in input is a binary refinement of a star-tree, and even when genes are present in only two copies in each species. Moreover, the problem remains NP-complete when the transfers are allowed in a single direction (e.g. only from 0 to 1) and even if the gene tree in input is an optimal resolution for the DL-Reconciliation. In this latter case, the problem is shown fixed-parameter tractable with respect to the gene tree's multiplicity factor. We then present a greedy heuristic for the general version of the problem solving each polytomy independently in a bottom-up traversal of the tree. This heuristic is shown to be exact for a unitary cost of operations. Moreover, we give a polynomial-time algorithm for the resolution of a single polytomy in the case where genes are specific to a single genome in all but one species. We did not explore the case where genes are specific to a single genome in all but a fixed number of species, but we believe Algorithm 5 can be extended to solve this problem in polynomial time.
From a biological point of view, the next step will be to apply our method to the orthologous mitochondrial protein-coding genes (MitoCOGs) dataset [2,10].
From a theoretical and algorithmic point of view, many open questions remain. Apart from the fact that a heuristic combining accuracy and time-efficiency should be developed for both the DLE-BinL and DLE-BinL1 problems in the general case, a more fundamental question is whether an exact one-step method, considering all the For 1 ≤ j ≤ m , let L ′ j be a gene tree which is species label isomorphic to L j . For 1 ≤ j ≤ m , let U j be the tree computed as follows: The gene tree G is then: Notice that for each species s ∈ , G contains exactly 2 leaves mapped to s and thus M �G,s L � = 2.
We set M(r(G)) equal to a matrix of ones of size | | × 2 (meaning that for each pair of leaves mapped to a given species s, we require one leaf to have a b-label 0 and the other to have a b-label 1). Also recall that I = {r(G)} . Finally, Cost is set to DL(G, S) + k.
We next show that I is a satisfiable instance of the Monotone NAE3SAT problem if (Lemma 12) and only if (Lemma 13) its corresponding instance I ′ of DLE-BinL1 decision problem admits a DLE-Reconciliation of cost lower than or equal to Cost.

Lemma 11
Let I be an instance of the Monotone NAE-3SAT problem. For its corresponding instance I ′ of DLE-BinL1 decision problem, the optimal DLE-Reconciliation R DLE is such that there is at least 1 EGTL event in each subtree T i of G (i.e. e E (x, y) = P for an edge (x, y) of T i ) for Proof For the optimal DLE-Reconciliation R DLE , for each clause C i = (x ∨ y ∨ z) ∈ C , 1 ≤ i ≤ k , for any b-labeling b L consistent with (M, I), there will be at least one EGTL event in the three following subtrees of T i (regardless of the labeling b of the internal nodes of these subtrees): This is the case because there are no duplication node in the DL reconciliation of these subtrees with S (so no EGT events can occur in these subtrees in R DLE by Lemma 1) and we know that at least one of these subtrees will not have all its leaves labeled by the same genome label (because two leaves with the same species label can't have the same genome label by construction of the instance) so at least one EGTL will be required.

Lemma 12 Let I be an unsatisfiable instance of the Monotone NAE3SAT problem. Then its corresponding instance I ′ of DLE-BinL1 decision problem does not admit a DLE-Reconciliation of cost equal or lower than
Cost.
Proof By contradiction, let us suppose that for an unsatisfiable instance I of the Monotone NAE3SAT problem, its corresponding instance I ′ of the DLE-BinL1 decision problem does admit a DLE-Reconciliation of cost equal or lower than Cost. Let's consider the optimal DLE-Reconciliation R DLE . R DLE is optimal and thus C(R DLE ) ≤ DL(G, S) + k as I ′ does admit a DLE-Reconciliation of cost equal or lower than Cost = DL(G, S) + k . By Lemma 11, R DLE is such that there is at least 1 EGTL event in each subtree T i of G for 1 ≤ i ≤ k . There is then at least k EGTL events in the reconciliation R DLE . As the cost of R DLE is equal to DL(G, S) plus the number of EGTL events in R DLE (from Lemma 4 in [2]), C(R DLE ) must be higher than or equal to DL(G, S) + k and we conclude that C(R DLE ) = DL(G, S) + k . Thus, there is exactly one EGTL event in each subtree T i of G for 1 ≤ i ≤ k and no EGTL event elsewhere in the tree as otherwise C(R DLE ) would be higher than DL(G, S) + k . In particular, there is no EGTL event in the subtrees U j , 1 ≤ j ≤ m , and we can conclude that all nodes in the subtree L ′ j , 1 ≤ j ≤ m , have the same genome label (there is no EGT event in the subtree L ′ j as there is no duplication in the DL-Reconciliation of L ′ j with S).
We now define a truth assignment TA as follows: for all 1 ≤ j ≤ m , set the variable ℓ j to True if the genome label of the nodes in L ′ j is 1, and set the variable ℓ j to False otherwise. We now show that TA satisfies I . For each clause C i = (x ∨ y ∨ z) ∈ C , 1 ≤ i ≤ k , we need to show that x, y and z are not all equal to each other. Let us suppose by contradiction that this is false, and that there exists a clause C i = (x ∨ y ∨ z) ∈ C such that x, y and z are all equal to each other. Then, by construction, the genome labels of the leaves x i , y i and z i in the corresponding subtrees T i are all equal to each other. Then, there is at least 2 EGTL events in T i , as at least two of the following three subtrees of T i will not have all their leaves labeled by the same genome label and there are no EGT events in those subtrees (by construction) because there are no duplication node in the DL reconciliation of these subtrees with S: This is a contradiction, as there must be exactly one EGTL event in the subtree T i . We then conclude that for each clause C i = (x ∨ y ∨ z) ∈ C , 1 ≤ i ≤ k , x, y and z are not all equal to each other. Thus, the truth assignment TA satisfies I , and we conclude by contradiction that if I is an unsatisfiable instance of the Monotone NAE3SAT problem, then its corresponding instance I ′ of the DLE-BinL1 decision problem does not admit a DLE-Reconciliation of cost equal or lower than Cost.

Lemma 13
Let I be a satisfiable instance of the Monotone NAE3SAT problem. Then its corresponding instance I ′ of DLE-BinL1 decision problem admits a DLE-Reconciliation of cost lower than or equal to Cost.
Proof Let R DL = �G, s lca , e� be the optimal DL-Reconciliation of G with S. We recall that, by definition, C(R DL ) = DL(G, S) . We will show that we can obtain a DLE-Reconciliation R DLE of cost lower than or equal to Cost from R DL by converting some duplication events into EGT events and by adding EGTL events. Notice that because the costs are unitary, converting a duplication event into an EGT event does not change the cost of the reconciliation. Thus, the cost of R DLE is DL(G, S) plus the number of EGTL events in R DLE .
Let TA be a truth assignment satisfying C such that the values in each clause are not all equal to each other (we know that such truth assignment exists because I is a satisfiable instance).
We now construct the b-labeling b (and b L ) and the mappings e V and e E as follows: Let e V = e . Let e E (x, y) = A for all edge (x, y) of G.
For all j, 1 ≤ j ≤ m , such that ℓ j is True (resp. False) in TA, we set b(x) = 0 (resp. b(x) = 1 ) for each nodes x of the left subtree of U j .
Notice that for each σ ∈ �\{S ij |1 ≤ i ≤ k, 1 ≤ j ≤ 6} , we have set the genome label of exactly one of the two leaves of G for which the species label is σ . For each σ ∈ �\{S ij |1 ≤ i ≤ k, 1 ≤ j ≤ 6} , we then set the genome label of the leaf with species label σ whose genome label have not been set yet to 1 − i where i is the genome label of the other leaf with species label σ.
For each nodes x on the path from the parent of r(T 1 ) to r(G), we set b(x) = 0 . We set b(r(T i )) = 0 for 1 ≤ i ≤ k and we set b(r(U j )) = 0 for 1 ≤ j ≤ m.
Therefore, there is no EGTL event on edges that are not in the subtrees U j ( 1 ≤ j ≤ m ) or T i ( 1 ≤ i ≤ k ), as all the nodes connected by those edges are labeled by 0.
We now show that no EGTL event is required in the subtree U j of G, for 1 ≤ j ≤ m . By construction, all the nodes in the left subtree of U j have the same genome label i ( i ∈ {0, 1} ) and the node in the right subtree of U j has the genome label 1 − i . Thus, b(r(U j ) l ) = b(r(U j ) r ) . Notice that r(U j ) is a duplication node in R DL and recall that b(r(U j )) = 0 . We then set e V (r(U j )) = EGT which is a transfer from 0 to 1. Therefore, there is no EGTL event in the subtree U j .
We now show that exactly one EGTL event is required in the subtree T i of G, for 1 ≤ i ≤ k . Notice that for any clause C i = (x ∨ y ∨ z) ∈ C , x, y and z can't be all equal to each other in TA (because TA is a solution of the instance) and so, by construction, the genome labels of x i , y i and z i in T i are not all equal to each other. Without loss of generality, let's assume that b(x i ) = 0 , b(y i ) = 1 and b(z i ) = 0 (the other possible cases are very similar). Then, the b-labeling of T i shown in Fig. 4 is correct and requires exactly one EGTL event.
We set e E (x, y) = P where (x, y) is the edge with a triangle on it in the tree above. We also set e V (lca T i ({x i , S i4 })) = EGT , e V (lca T i ({y i , S i5 })) = EGT and e V (lca T i ({z i , S i6 })) = EGT (those are the nodes represented by a triangle in the tree above). We can do so because those nodes are duplication nodes in R DL .
There is then exactly k EGTL events in R DLE . Thus, the cost of R DLE is DL(G, S) + k and C(R DLE ) ≤ Cost.
For each leaf x of G, we set b L (x) = b(x) . Notice that the b-labeling b L we constructed is consistent with (M, I) as for each σ ∈ � , there is one leaf labeled σ whose genome label is 1 and one leaf labeled σ whose genome label is 0, as required.
We then obtain a DLE-Reconciliation R DLE = G, s lca , b, e V , e E of G, s L , b L where b L is a b-labeling consistent with (M, I) for which C(R DLE ) ≤ Cost and we conclude that the instance I ′ of the DLE-BinL1 decision problem admits a DLE-Reconciliation of cost lower than or equal to Cost.
Note that, by construction, the instance of the DLE-BinL1 decision problem in the reduction contains a gene tree with no more than two leaves having the same species label. From this remark, and since Monotone NAE3SAT is NP-complete, Lemmas 12 and 13 lead to the result.

Proof of theorem 6
First observe that the One-DLE-BinL1 decision problem is in NP because the DLE-BinL1 decision Problem is in NP and because we can verify the one-direction condition in polynomial time.
We show that the One-DLE-BinL1 decision problem is NP-complete by reduction from the Monotone onein-three 3-satisfiability problem (Monotone 1-in-3-SAT Problem) defined as follows (monotone meaning that there are no negation of variables in the clauses): Monotone 1-in-3-SAT: Instance: A set of clauses C = (C 1 ∧ C 2 ∧ · · · ∧ C k ) on a finite set Ł = {ℓ 1 , ℓ 2 , . . . , ℓ m } of variables where each C i , 1 ≤ i ≤ k , is a clause of the form (x ∨ y ∨ z) with {x, y, z} ⊆ Ł; Question: Is there a truth assignment satisfying C such that exactly one literal in each clause is set to True?
Given an instance I = (C, Ł) of the Monotone 1-in-3-SAT problem, we compute, in polynomial time, a corresponding instance I ′ = (�G, s L �, (M, I), S, Cost) of the One-DLE-BinL1 decision problem. The corresponding instance I ′ is the same as in the proof that the DLE-BinL1 decision problem is NP-complete (see"Complexity of the dle-binl and dle-binl1 Problems" section).
We next show that I is a satisfiable instance of the Monotone 1-in-3-SAT problem if (Lemma 14) and only if (Lemma 15) its corresponding instance I ′ of the One-DLE-BinL1 decision problem admits a DLE-Reconciliation of cost lower than or equal to Cost.

Lemma 14
Let I be an unsatisfiable instance of the Monotone 1-in-3-SAT problem. Then its corresponding instance I ′ of the One-DLE-BinL1 decision problem does not admit a DLE-Reconciliation of cost equal or lower than Cost.
Proof The proof is similar to the proof of Lemma 12. All that is left to add to the proof for this restricted version is to show the following: If for a given clause C i = (x ∨ y ∨ z) ∈ C there are two of b(x i ) , b(y i ) , b(z i ) equal to 1 and the other one equals to 0 (corresponding to a clause for which two variables are set to True and one variable is set to False), then the corresponding subtree T i will contain at least 2 EGTL events. This is the case, as the only way to have only one EGTL event in the following subtrees of T i is to have an EGTL that transfers from 1 to 0, which is not allowed here (recall that there can be no EGT event in those subtrees because there are no duplication node in the DL-Reconciliation of these subtrees with S):